Self-healing using an alternate boot partition

ABSTRACT

Methods, apparatus and computer program products implement embodiments of the present invention that enable a computer system comprising networked computers to self-heal from a boot failure of one of the computers. In some embodiments, upon detecting a first computer failing to successfully load a first boot image, a second computer configures the first computer to boot a second boot image. Subsequent to configuring the first computer, the first computer is power cycled, and upon the power cycling, the first computer loads the second boot image.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a Continuation of U.S. patent application Ser. No.13/830,019, filed on Mar. 14, 2013, and is related to U.S. patentapplication Ser. Nos. 13/829,612, 13/829,906, 13/830,081, and13/830,153, each filed Mar. 14, 2013, and which are incorporated hereinby reference.

FIELD OF THE INVENTION

The present invention relates generally to computer systems, andspecifically to directing a computer to power cycle and boot from analternate boot partition, also referred to herein as a rescue bootpartition.

BACKGROUND

Operating systems manage the way software applications utilize thehardware of computer systems, such as storage controllers. A fundamentalcomponent of operating systems is the operating system kernel (alsoreferred to herein as a “kernel”), which provides secure computer systemhardware access to software applications executing on the computersystem. Since accessing the hardware can be complex, kernels mayimplement a set of hardware abstractions to provide a clean and uniforminterface to the underlying hardware. The abstractions provided by thekernel provide software developers easier access to the hardware whenwriting software applications.

Two common techniques for rebooting (i.e. restarting) an operatingsystem are a “cold boot” and a “warm boot”. During a cold boot, power toa computer system's volatile memory is cycled (i.e., turned off and thenturned on), and the operating system is rebooted. Since power is cut offto the memory, any contents (i.e., software applications and data)stored in the memory prior to the cold boot are lost. During a warmboot, the operating system reboots while power is still applied to thevolatile memory, thereby enabling the computer to skip some hardwareinitializations and resets. Additionally, during a warm boot the memorymay be reset.

In addition to a warm boot and a cold boot, the Linux operating systemoffers a method of rapidly booting a new operating system kernel via thekexec function. The kexec function first loads a new kernel into memoryand then immediately starts executing the new kernel. Using kexec toboot a new kernel is referred to a “hot” boot/reboot, since thecomputer's memory is not reset during the boot.

The description above is presented as a general overview of related artin this field and should not be construed as an admission that any ofthe information it contains constitutes prior art against the presentpatent application.

SUMMARY

There is provided, in accordance with an embodiment of the presentinvention a method, including configuring, using a second computer, afirst computer having multiple boot images to boot one of the multipleboot images, subsequent to configuring the first computer, power cyclingthe first computer, and upon the power cycling, loading, by the firstcomputer, the one of the multiple boot images.

There is also provided, in accordance with an embodiment of the presentinvention an apparatus, including a first computer including a bootdevice having multiple boot images, and a second computer coupled to thefirst computer and arranged to configure the first computer to boot oneof the multiple boot images, and subsequent to configuring the firstcomputer, to power cycle the first computer, wherein upon the powercycling, the first computer loads the one of the multiple boot images.

There is further provided, in accordance with an embodiment of thepresent invention a computer program product, the computer programproduct including a non-transitory computer readable storage mediumhaving computer readable program code embodied therewith, the computerreadable program code including computer readable program code executingon a second computer and arranged to configure a first computer havingmultiple boot images to boot one of multiple boot images, computerreadable program code executing on the second computer and configured topower cycle the first computer subsequent to configuring the firstcomputer, and computer readable program code executing on the firstcomputer and configured to load the one of the multiple boot images uponthe power cycling.

BRIEF DESCRIPTION OF THE DRAWINGS

The disclosure is herein described, by way of example only, withreference to the accompanying drawings, wherein:

FIG. 1 is a block diagram that schematically illustrates a storagesystem, in accordance with an embodiment of the present invention;

FIG. 2 is a block diagram of modules of the storage system configured toself-heal using a rescue boot image stored on a rescue boot partition,in accordance with an embodiment of the present invention; and

FIG. 3 is a flow diagram that schematically illustrates a method ofself-healing using the rescue boot partition, in accordance with anembodiment of the present invention.

DETAILED DESCRIPTION OF EMBODIMENTS

In computing, a boot comprises an initial set of operations that acomputer performs when electrical power is switched on (also referred toas power cycling). During a boot, a computer may load softwarecomponents such as an operating system kernel, services andapplications. The software components that are loaded during a boot aretypically stored in a system startup configuration file. For example,during a boot, a computer configured as a storage system may load aLinux operating system kernel, a network TCP/IP service and a storageapplication configured to process input/output (I/O) requests receivedfrom one or more host computers.

Software components that are loaded during a boot can be stored on aboot device as a boot image. When a computer system boots, the bootimage is retrieved and stored in memory as a software stack. In otherwords, a loaded software stack may comprise an in-memory representationof a corresponding boot image on a boot device.

In a computer network coupling a first computer to a second computer,there may be instances when upon power being cycled to the firstcomputer, the first computer fails to successfully boot a primary bootimage from a boot device. Embodiments of the present invention providemethods and systems for the first computer to recover the failed boot byreconfiguring the first computer to boot a rescue boot image from theboot device, and then power cycling the first computer to boot therescue boot image. The rescue image may comprise a “safe factorydefault” boot image whose components have proven to be stable.

In some embodiments, the boot device may be divided into a primary bootpartition (also referred to herein as a primary partition) configured tostore the primary boot image and a rescue boot partition (also referredto herein as a rescue partition) configured to store the rescue bootimage. Dividing the boot device into the primary and the rescuepartitions enables the (one physical) boot device to function asmultiple boot devices.

There may be instances when the second computer may detect that thefirst computer does not respond to any conveyed requests, possibly as aresult of the first computer failing to successfully load and/or executethe primary boot image. In the embodiments described herein, the secondcomputer can reconfigure the first computer to boot from the rescuepartition, and then power cycle the first computer.

While the embodiments described herein relate generally to a storagesystem such as clustered storage controller, it will be understood thatembodiments of the present invention may also be used for other types ofnetworked computer systems.

FIG. 1 is a block diagram that schematically illustrates a dataprocessing storage subsystem 20, in accordance with an embodiment of theinvention. The particular subsystem (also referred to herein as astorage system) shown in FIG. 1 is presented to facilitate anexplanation of the invention. However, as the skilled artisan willappreciate, the invention can be practiced using other computingenvironments, such as other storage subsystems with diversearchitectures and capabilities.

Storage subsystem 20 receives, from one or more host computers 22,input/output (I/O) requests, which are commands to read or write data atlogical addresses on logical volumes. Any number of host computers 22are coupled to storage subsystem 20 by any means known in the art, forexample, using a network. Herein, by way of example, host computers 22and storage subsystem 20 are assumed to be coupled by a Storage AreaNetwork (SAN) 26 incorporating data connections 24 and Host Bus Adapters(HBAs) 28. The logical addresses specify a range of data blocks within alogical volume, each block herein being assumed by way of example tocontain 512 bytes. For example, a 10 KB data record used in a dataprocessing application on a given host computer 22 would require 20blocks, which the given host computer might specify as being stored at alogical address comprising blocks 1,000 through 1,019 of a logicalvolume. Storage subsystem 20 may operate in, or as, a SAN system.

Storage subsystem 20 comprises a clustered storage controller 34 coupledbetween SAN 26 and a private network 46 using data connections 30 and44, respectively, and incorporating adapters 32 and 42, againrespectively. In some configurations, adapters 32 and 42 may comprisehost bus adapters (HBAs). Clustered storage controller 34 implementsclusters of storage modules 36, each of which includes an interface 38(in communication between adapters 32 and 42), and a cache 40. Eachstorage module 36 is responsible for a number of storage devices 50 byway of a data connection 48 as shown.

As described previously, each storage module 36 further comprises agiven cache 40. However, it will be appreciated that the number ofcaches 40 used in storage subsystem 20 and in conjunction with clusteredstorage controller 34 may be any convenient number. While all caches 40in storage subsystem 20 may operate in substantially the same manner andcomprise substantially similar elements, this is not a requirement. Eachof the caches 40 may be approximately equal in size and is assumed to becoupled, by way of example, in a one-to-one correspondence with a set ofphysical storage devices 50, which may comprise disks. In oneembodiment, physical storage devices may comprise such disks. Thoseskilled in the art will be able to adapt the description herein tocaches of different sizes.

Each set of storage devices 50 comprises multiple slow and/or fastaccess time mass storage devices, herein below assumed to be multiplehard disks. FIG. 1 shows caches 40 coupled to respective sets of storagedevices 50. In some configurations, the sets of storage devices 50comprise one or more hard disks, which can have different performancecharacteristics. In response to an I/O command, a given cache 40, by wayof example, may read or write data at addressable physical locations ofa given storage device 50. In the embodiment shown in FIG. 1, caches 40are able to exercise certain control functions over storage devices 50.These control functions may alternatively be realized by hardwaredevices such as disk controllers (not shown), which are linked to caches40.

Each storage module 36 is operative to monitor its state, including thestates of associated caches 40, and to transmit configurationinformation to other components of storage subsystem 20 for example,configuration changes that result in blocking intervals, or limit therate at which I/O requests for the sets of physical storage areaccepted.

Routing of commands and data from HBAs 28 to clustered storagecontroller 34 and to each cache 40 may be performed over a networkand/or a switch. Herein, by way of example, HBAs 28 may be coupled tostorage modules 36 by at least one switch (not shown) of SAN 26, whichcan be of any known type having a digital cross-connect function.Additionally or alternatively, HBAs 28 may be coupled to storage modules36.

In some embodiments, data having contiguous logical addresses can bedistributed among modules 36, and within the storage devices in each ofthe modules. Alternatively, the data can be distributed using otheralgorithms, e.g., byte or block interleaving. In general, this increasesbandwidth, for instance, by allowing a volume in a SAN or a file innetwork attached storage to be read from or written to more than onegiven storage device 50 at a time. However, this technique requirescoordination among the various storage devices, and in practice mayrequire complex provisions for any failure of the storage devices, and astrategy for dealing with error checking information, e.g., a techniquefor storing parity information relating to distributed data. Indeed,when logical unit partitions are distributed in sufficiently smallgranularity, data associated with a single logical unit may span all ofthe storage devices 50.

While such hardware is not explicitly shown for purposes of illustrativesimplicity, clustered storage controller 34 may be adapted forimplementation in conjunction with certain hardware, such as a rackmount system, a midplane, and/or a backplane. Indeed, private network 46in one embodiment may be implemented using a backplane. Additionalhardware such as the aforementioned switches, processors, controllers,memory devices, and the like may also be incorporated into clusteredstorage controller 34 and elsewhere within storage subsystem 20, againas the skilled artisan will appreciate. Further, a variety of softwarecomponents, operating systems, firmware, and the like may be integratedinto one storage subsystem 20.

Storage devices 50 may comprise a combination of high capacity hard diskdrives and solid state disk drives. In some embodiments each of storagedevices 50 may comprise a logical storage device. In storage systemsimplementing the Small Computer System Interface (SCSI) protocol, thelogical storage devices may be referred to as logical units, or LUNs.While each LUN can be addressed as a single logical unit, the LUN maycomprise a combination of high capacity hard disk drives and/or solidstate disk drives.

Examples of adapters 32 and 42 include switched fabric adapters such asFibre Channel (FC) adapters, Internet Small Computer System Interface(iSCSI) adapters, Fibre Channel over Ethernet (FCoE) adapters andInfiniband™ adapters.

As will be appreciated by one skilled in the art, aspects of the presentinvention may be embodied as a system, method or computer programproduct. Accordingly, aspects of the present invention may take the formof an entirely hardware embodiment, an entirely software embodiment(including firmware, resident software, micro-code, etc.) or anembodiment combining software and hardware aspects that may allgenerally be referred to herein as a “circuit,” “module” or “system”.Furthermore, aspects of the present invention may take the form of acomputer program product embodied in one or more computer readablemedium(s) having computer readable program code embodied thereon.

Any combination of one or more computer readable medium(s) may beutilized. The computer readable medium may be a computer readable signalmedium or a computer readable storage medium. A computer readablestorage medium may be, for example, but not limited to, an electronic,magnetic, optical, electromagnetic, infrared, or semiconductor system,apparatus, or device, or any suitable combination of the foregoing. Morespecific examples (a non-exhaustive list) of the computer readablestorage medium would include the following: an electrical connectionhaving one or more wires, a portable computer diskette, a hard disk, arandom access memory (RAM), a read-only memory (ROM), an erasableprogrammable read-only memory (EPROM or Flash memory), an optical fiber,a portable compact disc read-only memory (CD-ROM), an optical storagedevice, a magnetic storage device, or any suitable combination of theforegoing. In the context of this document, a computer readable storagemedium may be any tangible medium that can contain, or store a programfor use by or in connection with an instruction execution system,apparatus, or device.

A computer readable signal medium may include a propagated data signalwith computer readable program code embodied therein, for example, inbaseband or as part of a carrier wave. Such a propagated signal may takeany of a variety of forms, including, but not limited to,electromagnetic, optical, or any suitable combination thereof. Acomputer readable signal medium may be any computer readable medium thatis not a computer readable storage medium and that can communicate,propagate, or transport a program for use by or in connection with aninstruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmittedusing any appropriate medium, including but not limited to wireless,wireline, optical fiber cable, RF, etc., or any suitable combination ofthe foregoing.

Computer program code for carrying out operations for aspects of thepresent invention may be written in any combination of one or moreprogramming languages, including an object oriented programming languagesuch as Python, Java, Smalltalk, C++ or the like and conventionalprocedural programming languages, such as the “C” programming languageor similar programming languages. The program code may execute entirelyon the user's computer, partly on the user's computer, as a stand-alonesoftware package, partly on the user's computer and partly on a remotecomputer or entirely on the remote computer or server. In the latterscenario, the remote computer may be connected to the user's computerthrough any type of network, including a local area network (LAN) or awide area network (WAN), or the connection may be made to an externalcomputer (for example, through the Internet using an Internet ServiceProvider).

Aspects of the present invention are described herein with reference toflowchart illustrations and/or block diagrams of methods, apparatus(systems) and computer program products according to embodiments of theinvention. It will be understood that each block of the flowchartillustrations and/or block diagrams, and combinations of blocks in theflowchart illustrations and/or block diagrams, can be implemented bycomputer program instructions. These computer program instructions maybe provided to a processor of a general purpose computer, specialpurpose computer, or other programmable data processing apparatus toproduce a machine, such that the instructions, which execute via theprocessor of the computer or other programmable data processingapparatus, create means for implementing the functions/actions specifiedin the flowchart and/or block diagram block or blocks. These computerprogram instructions may also be stored in a computer readable mediumthat can direct a computer, other programmable data processingapparatus, or other devices to function in a particular manner, suchthat the instructions stored in the computer readable medium produce anarticle of manufacture including instructions which implement thefunctions/actions specified in the flowchart and/or block diagram blockor blocks.

The computer program instructions may also be loaded onto a computer,other programmable data processing apparatus, or other devices to causea series of operational steps to be performed on the computer, otherprogrammable apparatus or other devices to produce a computerimplemented process such that the instructions which execute on thecomputer or other programmable apparatus provide processes forimplementing the functions/actions specified in the flowchart and/orblock diagram block or blocks.

Self-Healing from a Failed Boot

FIG. 2 is a block diagram of modules 36 configured to self-heal using arescue boot image 60 stored on a rescue partition 62, in accordance withan embodiment of the present invention. In storage controller 34, theself-healing embodiments described herein enable a first given module 36to automatically detect and correct software and/or hardware failures ina second given module 36.

In the description herein, modules 36 and their respective componentsand data connections (i.e., data connection 44) may be differentiated byappending a letter to the identifying numeral, so that modules 36comprise a first module 36A and a second module 36B. Alternatively agiven module 36 may just be referred to as module 36. For purposes ofclarity, not all components in module 36A are included in module 36A(i.e., in FIG. 2).

Module 36 comprises a module processor 64, a module memory 66, anon-volatile memory 67 and a boot device 70. In embodiments of thepresent invention, boot device 70 may comprise a storage device such asa hard disk, an optical disk, a flash device (such as Compact Flash, USBstick or SDCard) or a solid state drive (SSD). In the configurationshown in FIG. 2, boot device 70 comprises a primary partition 72configured to store a primary boot image 74, and rescue partition 62configured to store rescue boot image 60.

Non-volatile memory 67 comprises a BIOS 68 configured to store power-onself-test (POST) procedures 76. When power is cycled to module 36,processor 64 can be configured to execute POST procedures 76, whichloads a boot loader 78 to memory 66.

Boot loader 78 is typically stored on a master boot record of bootdevice 70. When processor 64 loads and starts executing boot loader 78,the boot loader can be configured to read a boot flag 80 from vitalproduct data (VPD) 82 of adapter 42. VPD 82 may comprise non-volatilememory configured to store boot flag 80, plus any configuration andinformational data for adapter 42. Examples of adapters 42 that includeVPD 82 include but are not limited to Fibre Channel and Infiniband™adapters. Alternatively module 36 may store boot flag 80 in any VPD ornon-volatile Peripheral Component Interconnect (PCI) memory of a devicecoupled to processor 64. In an alternative embodiment, boot flag 80 maybe stored in non-volatile memory 67 or non-volatile memory 104.

In some embodiments, boot loader 78 can be configured to load, dependingon the value stored in boot flag 80, either primary boot image 74 orrescue boot image 60, and store the components of the loaded boot imageto a software stack 84 in memory 66. Each of boot images 60 and 74comprise an initial collection of components that boot loader 78 canload, upon power being cycled to module 36. Rescue boot image 60comprises a kernel 86, one or more services 88 and one or moreapplications 90. Primary boot image 74 comprises a kernel 92, one ormore services 94 and one or more applications 96.

Rescue boot image 60 may comprise a “safe factory default” boot imagethat is considered to be stable, and primary boot image 74 may comprisean updated version of the rescue boot image. In embodiments of thepresent invention, upon failing to successfully load primary boot image74, processor 64 loads rescue boot image 60 to software stack 84, andstarts executing kernel 86, services 88 and applications 90 from thesoftware stack in memory 66.

Module 36 also comprises a management controller 98 that is configuredto monitor the module's operation, and to reconfigure hardware and/orsoftware settings as necessary in order to optimize the module'sperformance. Management controller 98 comprises a management processor100, a volatile memory 102 and a non-volatile memory 104. While managingmodule 36, management processor 100 may access (i.e., read data from andwrite data to) module memory 66 and VPD 82 (or any non-volatile PCImemory, as described supra). Additionally, management processor 100 maybe configured to power cycle module 36, and to communicate with theother modules in storage controller 34.

Processor 64 typically comprises a general-purpose central processingunit (CPU), which is programmed in software to carry out the functionsdescribed herein. The software may be downloaded to module 36 inelectronic form, over a network, for example, or it may be provided onnon-transitory tangible media, such as optical, magnetic or electronicmemory media. Alternatively, some or all of the functions of processor64 may be carried out by dedicated or programmable digital hardwarecomponents, or using a combination of hardware and software elements.

Typically, management controller 98 is implemented as a “system-on-chip”(SOC), running an embedded software application. In this alternativeembodiment, the SOC may execute a software stack comprising a “standard”operating system (OS) and services (e.g., a Linux™ kernel and a webserver) that are typically not user-upgradeable. The SOC is typicallydedicated (i.e., not general purpose) and may be tightly controlled by avendor. In other words, upgrades are typically provided by the vendor(or manufacturer), and may be considered a “system firmware”, similar toBIOS 68.

The SOC may function as a robust, self-healing and self-sufficientsystem, configured to control processor 64 and the module processor'speripheral hardware, even when the controlled hardware malfunctions orcrashes. This robustness may be possible because the hardware andsoftware components of management controller 98 are typically designedto be self-sufficient and durable. Additionally, since managementcontroller 98 may be configured to run a “controlled” software designedfor a specific purpose (i.e., an end-user is typically not able to loadthe general purpose software stack to memory 102 or memory 104), themanagement controller can be more stable than kernel 86 running onprocessor 64, and therefore the management controller may be configuredto control the module processor.

While the embodiments describe herein have software stack 84 comprisingkernel 86, services 88 and applications 90, any organized collectioncomprising any number of components in memory 66 is considered to bewithin the spirit and scope of the present invention. For example, thecollection (e.g., software stack 84) may comprise only kernel 86.

FIG. 3 is a flow diagram that schematically illustrates a method forstorage controller 34 to self-heal (i.e., from a failure of module 36A)using rescue boot image 60A, in accordance with an embodiment of thepresent invention. In the embodiments described herein, module 36A isinitially configured (i.e., via a first value stored in boot flag 80A)to boot primary boot image 74A from primary partition 72A.

In an initial step 110, power is cycled to module 36A, and processor 64Afails to boot (i.e., load and execute) primary boot image 74A fromprimary partition 72A. The boot failure may be a result of a corruptedprimary boot image 74A or a problem with one of the physical regionsstoring the primary boot image on primary partition 72A. Alternatively,one of the software components in the primary boot image may fail toexecute properly. For example, a given service 88A (e.g., a TCP/IPservice) may have been recently upgraded, and the given service crashesupon being executed.

In a detect step 112, module 36B attempts to communicate with module36A, and detects that module 36A is not responding. For example,processor 64B may convey, via a unicast transmission, a request toprocessor 64A, and not receive a reply within a given time period. Uponnot receiving a reply, processor 64B, in a conveying step 114, conveys areconfiguration message (e.g., via a second unicast transmission) tomanagement controller 98A.

In a receive step 116, management processor 100A receives thereconfiguration message, and in a reconfiguration step 118, themanagement processor stores a second value to boot flag 80, therebyinstructing boot loader 78A to boot rescue boot image 60A from rescuepartition 62A when power is cycled to module 36A. In a power cyclingstep 120 management processor 100A power cycles module 36A. When poweris cycled to module 36A, processor 64A executes POST procedures 76A andloads boot loader 78A.

In a read step 122, boot loader 78A reads boot flag 80. Since boot flag80 currently stores the second value (see step 118) that indicates arequest to boot from the rescue partition, in a load step 124, bootloader 78A accesses rescue partition 62A, retrieves rescue boot image60A from the rescue partition and stores the components of the rescueboot image to software stack 84A in memory 66A.

In a boot step 126, processor 64A starts executing (i.e., boots) rescueboot kernel 86A, and in a start step 128, the processor starts executingrescue services 88A and rescue applications 90A. Finally, in a save step130, processor 64A saves software stack 84A to primary boot image 74A,and the method ends. In some embodiments (e.g., if one or more regionsin primary partition 72A were not readable), processor 64A may reformatthe primary partition prior to saving software stack 84A to the primarypartition. Additionally, after successfully saving software stack 84A toprimary partition 72A, processor 64A may store the first value to bootflag 80A, thereby reconfiguring module 36A to boot from the primarypartition.

The flowchart and block diagrams in the Figures illustrate thearchitecture, functionality, and operation of possible implementationsof systems, methods and computer program products according to variousembodiments of the present invention. In this regard, each block in theflowchart or block diagrams may represent a module, segment, or portionof code, which comprises one or more executable instructions forimplementing the specified logical function(s). It should also be notedthat, in some alternative implementations, the functions noted in theblock may occur out of the order noted in the Figures. For example, twoblocks shown in succession may, in fact, be executed substantiallyconcurrently, or the blocks may sometimes be executed in the reverseorder, depending upon the functionality involved. It will also be notedthat each block of the block diagrams and/or flowchart illustration, andcombinations of blocks in the block diagrams and/or flowchartillustration, can be implemented by special purpose hardware-basedsystems that perform the specified functions or acts, or combinations ofspecial purpose hardware and computer instructions.

It will be appreciated that the embodiments described above are cited byway of example, and that the present invention is not limited to whathas been particularly shown and described hereinabove. Rather, the scopeof the present invention includes both combinations and subcombinationsof the various features described hereinabove, as well as variations andmodifications thereof which would occur to persons skilled in the artupon reading the foregoing description and which are not disclosed inthe prior art.

The invention claimed is:
 1. A method, comprising: configuring, using asecond computer, a first computer having multiple boot images to bootone of the multiple boot images, the multiple boot images stored each ona plurality of partitions; subsequent to configuring the first computer,power cycling the first computer; and upon the power cycling, loading,by the first computer, the one of the multiple boot images.
 2. Themethod according to claim 1, wherein each of the multiple boot imagescomprises one or more components that are selected from a listcomprising an operating system kernel, a service, and a softwareapplication.
 3. The method according to claim 2, wherein the multipleboot images comprise a primary boot image stored on a primary partitionof a boot device and a rescue boot image stored on a rescue partition ofthe storage device, and wherein the one of the multiple boot imagescomprises the rescue boot image.
 4. The method according to claim 3,wherein the second computer configures the first computer to load therescue boot image in response to detecting the first computer failing tosuccessfully load and execute the primary boot image.
 5. The methodaccording to claim 4, wherein detecting the first computer failing tosuccessfully load and execute the primary boot image comprises the firstcomputer failing to respond to a request from the second computer. 6.The method according to claim 2, wherein loading the one of the multipleboot images comprises retrieving the one or more software componentsfrom the one of the multiple boot images, storing the retrieved one ormore components to a software stack in a memory, and executing the oneor more components in the software stack.
 7. The method according toclaim 6, and comprising saving the one or more components to the primaryboot image, and reconfiguring the first computer to boot the primaryboot image.
 8. The method according to claim 1, wherein configuring thefirst computer to boot the one of the multiple boot images comprisesstoring, to a non-volatile memory, a value indicating the one of themultiple boot images.
 9. The method according to claim 8, and comprisingprior to loading the one of the multiple boot images, retrieving, fromthe non-volatile memory, the value indicating the one of the multipleboot images.